Credits: Deep Learning, Ian Goodfellow
These are just my notes. I like writing.
3.1 Why Probability?
I'de thought I had a pretty decent understanding of the applications of probability, but this book gave me some non-obvious insights about why probability is so important to the study of machine learning and deep learning.
Probability gives us the ability to quantify and reason about uncertainty. But where does this uncertainty come from?
- Inherent stochasticity: Sometimes the system we are study is just random by nature.
- Incomplete observability: Sometimes we just don't have access to all the information needed to determine the outcome of system.
- Incomplete modeling: Some models may need to be simplified for various reasons. Simplifying models usually mean discarding some information. This loss of information leads to uncertainty in the model's predictions.
One virtue of uncertain rules and models is what I call the "effort to result ratio" (not a real term, but it helps me reason about this topic). The rule "Most birds fly" is pretty much true. The rule "Birds fly, except for very young birds that have not yet learned to fly, sick or injured birds that have lost the ability to fly, flightless species..." is absolutely true, but it takes a lot more "effort" to define.
"[This rule is] expensive to develop, maintain and communicate and, after all this effort, is still brittle and prone to failure."
For this reason, developing models isn't always only about the fidelity. Making a good choice often comes down to experience, intuition, and taste.
Views of probability
Frequentist probability
Under the frequentist view of probability, events and their outcomes are assumed to be reproducible. For example, the result of a given poker hand can be repeated, and the probability of winning that hand can be calculated as the proportion of wins.
Bayesian probability
Under the Bayesian view of probability, events and their outcomes are not assumed to be repeatable. Instead, the probability represents a degree of belief of the outcome. Ie. a probability of 1 represents an absolute certain belief in a certain outcome.
Although we can define these two types of probability, as it turns out, the rules to reason about both types of probability are exactly the same (trust me bro). These definitions are just useful for interpreting the probability value.
3.2 Random Variables
A variable that is random. The "value" of a random variable can be discrete or continuous.
3.3 Probability Distributions
Discrete RV
A description of how probable each outcome of a random variable is. The probability distribution of a discrete RV is typically called the probability mass function (PMF), which takes in a value (outcome of the RV) and returns a value between 0 and 1, the probability of that outcome happening. The
Notation:
Notation:
It is not useful to get too caught up in the notation of these functions.
For a function
- The domain of of
must be exactly all the possible states of the . - For any
, is between 0 and 1. - The sum of all values of
= 1. aka is normalized.
Continuous RV
A similar function for a continuous RV is known as the probability density function (PDF).
For a function
- The domain of
must be the set of all possible states of . - For any
in , is greater than or equal to 0. Note: we do not require that be less than or equal to 1. - The integral of
across its entire domain is 1.
3.4 Marginal Probability
Joint Probability
A PDF or PMF can be also be a function of multiple RV's. For example,
Now that we have a joint probability distribution
For example,
3.5 Conditional Probability
Note: the conditional probability is not defined when
3.6 The Chain Rule of Conditional Probabilities
Using some algebra, we notice that:
Furthermore, we can generalize, and treat
This is useful because in many cases, the conditional probability of a variable is much easier to find than the "raw" joint probability. For example in language models, the next predicted word is found by taking the word with the highest probability of occurring given the current phrase. This is easier than finding the entire sentence at once, which is analogous to finding the entire joint probability of all its constituent words.
3.7 Independence and Conditional Independence
Two RV's are independent if they have no effect on each other. Thus,
Notated as
In some cases,
Notated as
3.8 Expectation, Variance and Covariance
Expected Value
Let
Treating "EV" as a function, we can say it is a "linear transformation" of
Variance
The variance gives a measure of how "spread apart" the values of a function of a RV is.
Let
The intuition behind this is that we are taking the expected value of the difference between the value of f and the expected value of f. Thus, if the expected distance between f and EV f is high, the values of f are highly "spread apart".
The reason we square the difference between f and EV f is because we need all the values to be positive, BUT taking the absolute value has some weird mathematical consequences.
Now, taking the square root of the variance gives the standard deviation σ. We have a distinct name of this value because it is used a lot.
Covariance
The covariance of two values measures how the difference of one variable
This measure is affected by the scale of f and g. We can "normalize" out this affect by dividing by the standard deviations of f and g and get the correlation of f and g.
This gives a "pure" measure of the correlation (lol) of two variables.
Relationship between Covariance/Correlation and Independence
Covariance and independence are related because two independent variables must have a covariance of 0, but a covariance of 0 does not mean that the two variables are independent.
This is because covariance is a purely linear measurement, and variables can still be dependent via a nonlinear relation.
This notion can also be applied to correlation and independence.
Covariance Matrix
Let
An
Notice that the diagonal of this matrix is: